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Abstract 

The mean-shift algorithm is a popular algorithm in computer vision 
and image processing. It can also be cast as a minimum gamma-divergence 
estimation. In this paper we focus on the "blurring" mean shift algorithm, 
which is one version of the mean-shift process that successively blurs the 
dataset. The analysis of the blurring mean-shift is relatively more compli- 
cated compared to the nonblurring version, yet the algorithm convergence 
and the estimation consistency have not been well studied in the litera- 
ture. In this paper we prove both the convergence and the consistency of 
the blurring mean-shift. We also perform simulation studies to compare 
the efficiency of the blurring and the nonblurring versions of the mean- 
shift algorithms. Our results show that the blurring mean-shift has more 
efficiency. 

keywords Mean-shift, Convergence, Consistency, Clustering, 7-divergence, 
Super robustness. 



1 Introduction 

The mean-shift algorithm is a popular algorithm in computer vi sion and image 

proce ssing. It was initially designed for kernel density estimation (jFukunaga and Hostetler 
197a) , which iteratively uses the sample mean within a local region to estimate 
the gradient of a densit y functi on. The mean-shift algorithm was f urther ex- 
tended and analyzed by ICheng] (1995). Comaniciu and Meer ( 2002 ) later ap- 
plied the mean-shift algorithm to the problem of image segmentation. Since then 
the algorithm has become more well-known in the computer science community 
than in the st atistics community. For m o re related works on the mean- shift 
algorithm, see iFashing and Tomasi (l2005h : ICarreira-Perpinanl (120061 l2007lh In 
recently years, methods that use iterative proc esses on minimizing 7-diverg ence 
were proposed for robust parameter estimation (jFuiisawa and Eguchill2008l) and 
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for robust clustering ([Chen et all 120121) . These methods can also be viewed as 
the mean-shift based approaches. 

Suppose S — {xx,...,xn} are sample points and T = {yi, . ■ ■ iVm} are 
cluster centers. The nonblurring mean-shift updating rule can be denned as 
follows: 



v 



(*+i) 



N 

■j = l l^k=l J\ X k 



(1) 



where / is a kernel function, w is a weight function, and yf = i/j. The con - 
vergence of the nonblurring versio n of mean-shif t was studied in IChend (119951) . 



Comaniciu and Meerl (200C 

When T = S, the updating rule becomes 



20011 ). and lLi et all ([20071 ) . 



„(*+!) 



N 

£ 
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i Ef=i f( x 



(*) 

k 



)w(x[ 



(2) 



where xf^ — x%. This is called the blurring mean-shift. Note that the weighted 
average is over the updated data points, instead of the original data. The 
convergence analysis on t he blurring m ean-shift is therefore more complicated 
than the nonblurring one. Cheng ([19951 ) proved the convergence of the blurring 
mean-shift algorithm for the following two limited cases. When the mutual 
influen ce between each pair of data points is nonzero, Theorem 3 in Chend 
(|l995l) showed that all data points eventually converge to a single cluster. When 
in practice the iterative process is simulated by a digital computer su ch that 
data p oints can never go arbitrarily close to each other, Theorem 4 in Chend 
( 19951 ) guaranteed that the algorithm converges in a finite number of st eps. In 
Sectio n 2, we show that there is a gap in the proof of Theorem 4 by IChend 
<ll995h . We also discuss related work and the condition on / and w. 



In Section 3, we present a more general re sult on t he con vergence of the blur- 
ring mean-shift algorithm than Theorem 4 in IChend(ll995 h. The convergence of 
the blurring mean-shift is guaranteed under the general definition: data points 
eventually become arbitrarily close to some locations. Since the number of data 
points is always finite, there exists a common t* , such that each data point is 
close enough to where it converges after the t*-th iteration. That is to say, the 
convergence under the general definition can imply the convergence in a finite 
numbe r of st eps subject to floating point precision. In addition, Theorem 3 in 
IChend (|l995h is an immediate implication of our result, which is listed in our 
Corollary [TJ 

While the mean-shift algorithm is originally designed for mode seeking using 
kernel density estimation, it is questioned that whether this estimation produces 
results that converge to the true param eter values when the number of data 
points goes to infinity. IWindham ( 1995 ) proposed a robust model fitt ing, which 
can be viewed as a nonblurring approach. iFuiisawa and Eguchil (|20081 ) proposed 
a robust estimation by minimizing 7-divergence and proved the consistency of 
their proposed estimation. This is also a nonblurring approach. In the literature, 
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the consistency of blurring processes has not been well studied. We present the 
consistency of the blurring processes in Section 4. 

In additional to convergence and consistency, in Section 5 we present simu- 
lation studies to compare the performance of the blurring and the nonblurring 
processes. Discussions and conclusions are given in Section 6. 

In this section we present a proof of the convergence of the blurring mean- 
shift process. We will first discuss related work, and introduce some conditions 
on / and w in @. 



2 Related Work and Conditions 



Before we start the proof of co nvergence, it i s necessary to b r ing o ut some of 
our comments on related works ICheng (ll995l) : IChen and Shiul (l2007h . 



2.1 A GAP in the Proof of Theorem 4 in IChend dl995h 



A s mentioned in the previous section, there is a gap in t he proof of Th eorem 4 
in Chend (1995). Quote from the proof of Theorem 4 in ( Cheng . 1995): 



Lemma 2 says that the radius of data reaches its final value in 
finite number of steps. Lemma 2 also implies that those points at this 
final radius will not affect other data points or each other. Hence, 
they can be taken out from consideration for further process of the 
algorithm. 

This implication of Cheng's Lemma 2 is questionable in two respects. First, 
when the radius of data points reaches its final value, it is not trivial to conclude 
that there do not exist two data points alternatively switching their locations 
to be at the final radius, meaning that data points in such a situation fail 
to converge. Although this situation will not happen during the mean-shift 
iterative process, it requires to be proven. See our Lemma [5] and its proof. 

Second, the convergence of some points at the final radius does not imply 
that these points do not affect other points. Although these points no longer 
move, it is possible that they still receive influences from other points, which 
are just too small to induce a move larger than the floating point precision. The 
accumulated influences from these converged data points at the same location 
may be large enough to affect other data points and to induce them a different 
move. Therefore, these converged data points should not be immediately taken 
out for future process of the algorithm. 



2.2 The weight function 



w 



It was stated (jChengJ, 119951 ) that the weight function w can be either fixed 
through the process or re-evaluated after each iteration, the convergence was 
only studied for the case when w is fixed. In fact, we found that the process does 
not converge for arbitrary w's that change over the iterations. The following 
example illustrates this. 
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Example 1. 




Assume the number of data points is 3. Let x\ = Si, xi = 1/2 + #2, 
x 3 = -1/2 - S 3 , where < Si < 1/4. Let 

d = 0, 

<d< 1, 

1 < <£ 

Since X2 — x 3 > 1, /(&2 — ^3) = 0, meaning that x2 and x3 do not influence 
each other in the next update. Let w(x) = 1 for —1/2 < x < 1/2. Therefore, 
w(xi) — 1. Now we can assign large value to w{x2) and w(x3) so that 

(1 ) _ w(x2)x 2 +X 1 /2 

2 " w{x 2 ) + l/2 >V ^' 
(i) = w(x 3 )x 3 + x 1 /2 

3 10(2:3) + 1/2 7 

We can also assign a large enough value to w(x 3 ), so that 

1 /o / - xl + ^(^W 2 + w(x 3 )a; 3 /2 
/ 1 — 1 7 \ 7 \ ^ ^' 

1 + W[X2j + W(X 3 ) 

These inequalities show that after the first update, a;^ 1 ' becomes negative, and 
^2 an< i x z remain outside [-1/2, 1/2]. 

At each iteration, we can assign large enough values to w(a; 2 ) and 10(2:3), 
so that x^ is positive when t is even and is negative when t is odd. We can 
further control the absolute value of xf 1 to be away from zero, so that x± 
and consequently the whole system do not converge. Note that x% and x 3 ^ do 
converge in this case. 

Having seen the above example, in the next section we only prove the con- 
vergence under the condition when w{x^Ys are fixed throughout the process 
meaning that wixf )'s depend on 4. It is worth noted that the convergence of 
the iterative process in fact also holds for varying io(a;^)'s with \\m. t w{x^f > ) 
existing for each i. 

2.3 The influence function / 

While the mean-shift algorithm was originally developed for kernel density es- 
timation, it is natural to have / in $2$ to be integrable. A weaker condition of 
/, however, suffice s to gu arantee the convergence of the iterative process. 



Chen and Shiul (|2007| ) proposed a self-updating process (SUP) for clustering 



as follows: 

(i) x^', . . . , 2:^ £ R p are data points to be clustered. 
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(ii) At time t + 1, every point is updated to 

(t+i) _ n x i > x j ) (t) r „x 



ar. 



where / is some function that measures the influence between two data 
points at time t. 

(iii) Repeat (ii) until every point converges. 

Although not specified in the notation, the / function in ([3]) is allowed to be 
inhomogeneous with respect to t. That is to say, it is more general compared 
to the / function in the mean-shift updating rule in @. IChen and Shiul (|2007l ) 
has demonstrated the use inhomogeneous /'s in several of their experiments. 
The / function in © does not require to be integrable. It is proposed to satisfy 
the following PDD condition. 

Definition 1. The function f in |3J) is PDD (positive and decreasing with 
respect to distance), if 

(i) < /(it, v) < 1, and f(u, v) = 1 if and only if u = v. 

(ii) f(u,v) depends only on \\u — v\\, the distance from u to v. 

(iii) f(u,v) is decreasing with respect to \\u — v\\, 



Note that / in (|2|) is already defined to be only depending on u — v. In the 
lowing, we 
depends on i. 



following, we will prove the convergence under (i) / is PDD and (ii) w{xp) only 



3 Convergence 

Theorem 1. If the function f in 0) is PDD, and if the weight function 
w(a£p) = Wj in (0) depends only on j , there exists {x\, . . . , x* N }, such that 

lim x"p — x* Vi. 

t— ^oo 

Below we outline the proof for Theorem [T] 

• First, consider the convex hull of all data points in each iteration. The 
convex hulls with respect to iterations are nested (LemmaU} and converge. 

• Next, for each vertex of the converged convex hull, there exists at least one 
sequence of the updated data points converging to this vertex (Lemma [5]). 

• The influence from the converged data points at the vertices of the con- 
verged convex hull goes down to zero to other data points (Lemma [3]) . 
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• Consider the convex hull of the rest data points (exclude those already 
converged). Using the same arguments again, we have a few more con- 
verged data points. We can repeat this process over and over again until 
all data points converge. 

Definition 2. The convex hull C(X) for a set of points X in a vector space V 
is the minimal convex set containing X . 

Lemma 1. Let be the convex hull of {x\ , . . . , xS }. Then 

Proof. The convex hull C(X) for a set of points X is the minimal convex set 
containing X . Since 

JV 

_(*+!) _ tl 

N 

J'=l 

xf +1 ^ is a weighted average of x* for j = 1, ...,N. Therefore, xf +1 ^ G dp . 
Since the above is true for each i, we have 

cPDC({ X t 1 \...,4 +1) }) = c[ t+1 \ 

□ 

Note that the nested structure presented in Lemma[T]ensures the convergence 
of convex hulls {C^ }. Let C\ be the limit of C'P , 

oo 

d= lim cW = ncW. 

t^oo 1 I I 1 

t=0 

On the other hand, since the convex hull of any finite set of points in R p is a 
polytope, each dp is a polytope. Each vertex of CP therefore must contain at 
least one x)p for some i, otherwise the polytope would have been smaller. With 
the convergence of convex hulls {CP}, Lemma [5] claims that for each vertex of 
Ci, there exists at least one equence of {a^*'} which converges to this vertices. 

Lemma 2. If the function f in (0) is PDD, for each vertex vij of C\, there 
exists at least one j , such that 

lim xf =vi,i. (4) 
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Proof. Since C\ — lim^oo for each i, there exists a sequence of i>i*]'s (ex- 
change vertex indices if necessary), such that lim t _ i . 00 v*f\ = v\.i, where vf\ is a 
vertex of cjp. Since for any t and i, — xP for at least one k, there exists 

j, such that = for infinite many i's. Therefore, there exists an infinite 
time sequence i„'s, such that 



which leads to 



2^ = v^p Vn, 



lim a^-*"' 1 = V\ 



„(*) _ „,(*) 
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except for any finite t, then equation (j4|) is established. Otherwise, 
there exists j' ^ j and another infinite time sequence s„'s, such that 

Xj/ = v[ s p Vn. 

Without loss of generality, assume that = xp or xp for all i > i . Assume 

Wj > wji . From equation ([3]), if xp = xp for some s, xp — x^? for all 

t > s. Therefore, for any s > 0, there exists t > s, such that uj* ] = je^ and 

^l'^ -1 ' = ^i' -1-1 ^ We c l anT1 that this case, however, can never happen: when t 
is large enough, it is impossible that a data point inside the convex hull later 
becomes a new vertex, since it is closer to other points than the current vertex 
is. In the following we prove this claim only for the one dimensional case. For 
higher dimensional cases, consider the supporting hyperplane contained v±j. 
Since vi.i is a vertex of a convex set, a supporting hyperplane can be chosen 
such that no other point is in the hyperplane. Now we can project all data points 
onto to the straight line which is perpendicular to the supporting hyperplane 
and pass through v±^. Then we can make the same argument on the projected 
data points. 

Without loss of generality, assume V\j — 0, Xj < 0, and x^p > for k ^ j 
or j' . If 2y +1 ^ later becomes the new vertex, then 

N N 

^ < ^ • (5) 

J2f(xP-xP)w k J2f(xP-xP)w k 

fe=i fc=i 

Moreover, since x^ 1 ^ is the new vertex, 

N 

J2f( X p-xP) WkX p 



<0 > /(x '-^V^fl. 



AT — v J 
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Since is the current vertex, \\Xj — x k \\ > \\ x j> ~ x k \\ f° r an ^, an< ^ hence 
f(xf - xf) < f{xf - xf). Then 



AT 

(*) 

k=l 



Since 



and 



we have 



= Wj rxf) + f{ X f - xf) WjX f + e f(*f - 

> w jX f + f{xf - xf) Wj ,xf + ]T f{xf xf)w k xf 

N 
k=l 



N N 

E f{xf - xf)w h xf < E f {xf xf)w k xf < 0, 

k=l k=l 



N N 

< E f{xfxf)w k <J2f(4' x ^ w «> 

k=l k=l 



N N 

£/(*?- ^WL 45 
/t=i fe=i 



W JV ' 

E/^-^fK £/(*?>- 

k=l k=l 

which is a contradiction to ([5]). □ 

Having shown that at least some points converge under the iterative up- 
dates, hereafter we consider the rest of the data points. Let fii be the set of 
points shown converging to the vertices of C\. Define Crp be the convex hull 
of {s^jj^sij . Note that {C^} may not be nested at early stages of iterations: 
points not in Qi may move outside the current convex hull C% due to the influ- 
ence from fii, the volume of the convex hull therefore may increase by iteration. 
This nested property, however, would hold after some iteration when all data 
points in fii converge. Explicitly, 

Cf D C ( 2 t+1) Vi > I for some t, 

which also implies the convergence of {C 2 }, 

C 2 = lim d' } . 

t—>oo 
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We introduce the following Lemma [3J which can lead to the nested property of 
{Cf }■ It states that when all data points in Ox converge, points in tti receive 
no influence from points not in otherwise they would have been attracted 
inwards. That is to say, data points not in fli also no longer receive influence 
from points in fii, meaning that the influence from points in fii goes down to 
zero. 

Lemma 3. For an arbitrary xi € f2i, we have 

lim f(xf -xf) = 0, 

t— too J 

for all j such that lim^oo xf 7^ lim^oo xf . 

Proof. Without loss of generality, assume that xf is the only data point that 
converges to Vi t \. 

N 

^f(xf-xf) Wj xf 
3=* = x ^+i) 



N 

^f {x f-xf)w 3 



0=1 



N 

^fixf-xf^wj-ixf-xr 1 ') 

3=1 



, (*) (*)•> 
{X} — X) )Wj 



^ N 

3=1 

N 

=> £/(^ " *?H- • {xf - = *i ■ - xf). (6) 

Since xf converges to u^i, x\ t+1 ^ and xf become arbitrarily close to each 
other when t is large enough. That is, the right-hand side of © goes down to 
zero. On the other hand, since xf does not converge to 1^1 for j 7^ i, there 



is a gap between x^ and x\ .To force the left-hand side of © to be zero, 



f(xf — xf) must go down to zero as well. This sketches the proof for Lemma 
[3l The precise details are given in the following. 

Because xf does not converge to u^i for j 7^ i, there exists e > 0, for any 

to > 0, there exists t > t such that \\xf — Vi t i\\ > e. In fact, xf can not go 
arbitrarily close to v^i when t is large enough, otherwise the updating process 

will move xf and xf closer and closer to each other. That is, there exists 

eo > and t\ such that \\xf — > e\ for all t > t\. On the other hand, 

because xf — > v it i, for any e 2 > 0, there exists t 2 , such that \\xf — xf +1 ^ \\ < e 2 
for t > t 2 . 
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Since v\ t i is a vertex of the convex set C±, there exists x <E C\, such that the 
inner product of x — and y — is positive for any y £ C\. Let 

x — V\i 



There exists a > and t 3 > t\ such that 

(xy - vi ti ,v x ) > a\\xy - vi ti \\ V* > £ 3 and Vj 7^ i, 

where (, ) denotes the inner product. Take the inner product of both sides of 
([6]) with v x , we have 



N 
N 

E/K (t) -4 t} H--« 



(*) . „ \ J_ /„, . „(*+!) 



AT 

— X] ^ ~ X j*'') l£ 'i a ll a; j _ U l, 

iV 

> maxwj • aei f(x^ — %f) 
3 &i 



Xj - Vl,i,V x ) + {Vx,i - x\ ,v a 



for t > <3, and 



- x> < e 2 



(zf +1) -zfU)<||a; 
for i > t?,. Therefore, for t > max^,^), 

N 

Since e 2 can be arbitrarily small, the inequality above implies 

N 



Since / > 0, - xf) ->■ for all j ^ i. □ 

From the above, we can claim a similar result for C 2 as Lemma[2]for C\ : each 
of the vertex of C 2 has at least one data point converges to. The same argument 
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can apply again and again to C3, C4, . . ., until all data points converge. This 
completes the proof of Theorem [T] 

Although Theorem[T]guarantees the convergence when / has PDD condition, 
there are some /'s that produce trivial clustering results, in which all data 
points are clustered into one single group. We identify such f's in the following 
corollary. 

Corollary 1. Let ru = maxij{\\xi — Xj\\}. If f is PDD with /(Vm) > 0, then 
there exists c, such that 

lim x^ = c Vi. 

t-»oo 1 

Proof. Lemma [T] implies that \\x^ — x^p\\ < ru for every t, i andj. Since / is 

decreasing with respect to distance, f(x^—Xj) > /(tm) > 0. Lemma[3]shows 
that, however, the influence between any two points which do not converge to 
the same position tends to zero. Thus, f{x^ — x^ ) > /(r^-) > for every i 
and j, which implies that all data points converge to the same position. □ 

For the purpose of clustering, it is not desirable to have all data points 
converged to the same position. To prevent trivial clustering results, / has to 
be zero on (r, 00) for some r < Tm- 



4 Consistency 

In the previous section, we proved the convergence of the algorithm. In this 
section, we study the estimation consistency of the algorithm. We show the 
consistency for the Normal case and remark on more general cases. The difficulty 
of our consistency proof arises from blurring process, i.e., the the iterative data 
shrinkage update. 

Assume Xi's £ R p are i.i.d. sampled from N(0, E), and the mutual influence 
function / adopted is exp(— (x— y) T (x~ y)/2r 2 ), where (x— y) T is the transpose 
of vector x ~ y. Assume w = 1. The updating rule is: 



x 



A /o»ff-*, (t) 

V W f fr (t) - r (l> ) 

j=l Z~/j=l J y-^i.n ^j,n> 



(t+1) _ J y^i.n ^j.n) (t) 



/-^ i-^N „, (t) (t) J,"' 



x j.m (7) 



where xf^ denotes the updated Xi at t-th iteration when considering only first 
n samples. By Corollary Q] presented in the previous section, we know that for 
all i 

lim xfl = c 

for the same c. Here we want to show that c will converge to zero almost surely, 
which we state as the following theorem: 

Theorem 2. 

lim lim x\ „ = a.s. 

n— ^00 t— >oo ' 
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Proof. Let G(x; S) be the CDF of A(0, S), G { n\x) be the empirical CDF of the 
n-sample at t-th iteration, and G^(x) = lim^^ Gn\x). By Glivenko-Cantelli 
theorem, 

lim sap\G^\x)-G(x,Jl)\ =0 a.s. 

n— t-oo x 

We claim that the the empirical distribution of the updated data points of each 
iteration converges to a Normal distribution. In the following, we show that 



lim sup\G%\x) - G (t) (x)\ = a.s. 



(8) 



where G^(x) — G(x;£ t ). This is true for t = 0. Assume that it is true for 
t = s, we want to show that it is true for t = s + 1. Assume that 

snp\G< h s \x)-G^(x)\ <e s , 



for n > N e . Define 



K H (x) 



J y f(x-y).ydH(y) 
J y f(x-y)-dH(y) ■ 

With the assumption that G^(x) — G(x; S s ), we have 



f(x - y)dG ( - s 



(x-y) T (x-y) y T Y, s 1 y 
c s exp( — 2 ) • exp( )dy 

"if: 

c s exp — 

c' s (x) exp 

{y-(I + r^-^x)} 



1 -{y-(I + r^)^x} T (iy + ^) 



dy 



dy. 



Therefore, 
Since 



K Gis) (x) = (i + T 2 z; 1 )- 1 x. 



(9) 



|Gi s >(x)-G^(*)|< es 
and f(x — y)y and f(x — y) are bounded, we have 

\\K G (s){x) - K Gis )(x)\\ 2 < a s e s 
for some positive number a s where || • ||2 is the L? norm. Since 



(10) 



„ . (s h I y M%~y)-ydG^(y) 

Jyf( X i,n-y)- dG n'(y) 



Ej=l/s( x i,ti X j,n) 



-(-+1) 
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we have 



= ll^ G w(»S)-^G(.)(^)lla 
< a s e s . 

The empirical distribution of is Gl s+1 '(x), and that of (I + t 2 Hj 1 )~ 1 x^ 1 

is G4* ) ((Z + T a S7 1 )a). Then 

-G« ((I + r 2 ^)x) 

G« {(I + r 2 ^)(x + Ax)) - {(I + t 2 ^- 1 )x) 

« { G« ((/ + ^EJ 1 )^ + Ax)) - G« ((/ + r 2 ^)(x + Ax)) 

G« ((7 + t 2 EJ 1 )(x + Ax)) - G (s) ((7 + t 2 ^ 1 )^) |} 

G« ((/ + r 2 ^ 1 )^ + As)) - G« ((I + t^-^x) 



< max 

\\Ax\\<a e 

< max 

||Ax||<c 



< e s + max 

1 1 Ax 1 1 <a s e s 



< e 8 + max + t^EJ^Ax^ max ||— G«(x + Ax)|| 2 

||Aa:||<a s e s ||Aa:||<a s £ s OX 

1 

< e s + Aa s e. 



2 7 r|det(/ + T 2 I] s - 1 )| 1 / 2 ' 

where A is the largest eigenvalue of I + t 2 Sj 1 . Thcrfore, \G { n +1) (x) - C?W ((I + 
t 2 YiJ 1 )x)\ can be arbitrarily small by choosing a small enough e s . This com- 
pletes the induction. 
From ©, we have 

E s+1 = (7 + T 2 Sj 1 )- 1 S s (7 + r 2 S s - 1 )- 1 . 

Since E s is a covariance matrix, it is symmetric and positive definite. Then E s 
can be factorized as 

E s = PA S P T 
where PP T = I and A s is a diagonal matrix. Then 

E; 1 = PA^P T , 

I + T 2 ^- 1 = P(I + T 2 A- r )P T , 

S s+ i = (7 + r 2 E s - 1 )- 1 E s (7 + r 2 E7 1 )- 1 

= P(I + T 2 A- 1 )- 1 A s (I + T 2 A; 1 y 1 P T . 



(s) 

Therefore, E s and E s+ i share the same eigenvectors. Assume that A- s are 
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the eigenvalues of S s and A^ s+1 ^'s are those of £ s +i- Then 



(A«+t 2 ) 2 ' 

< ( A ^ 0) ) 2 A W 

- \(Af +r 2 ) 2 / * • 

Therefore A^ — > as s — > oo. For any e, there exists to such that mini A^' '' < 
e 2 /k, where k is a large integer. From (|8]), almost surely 

su P |G(* )(^)"G(*°)(^)| ->0. 

Equivalently, 

sup|G^(A)-G^(^)|^0, 

where Gn ^^) and G'*°'(yl) denote the probabilities of a; € A. Therefore, for 
any S > 0, there exists nt such that 

sup\G^(A) -G M (A)\ < 5 

A 

for all n > n to . Then 

Pr(||a$>|| a > e ) = G^(x T x>e 2 ) 

< G {u,) {x l x > e 2 ) +6 

mini A- mini A- 

mini \\ 

< G ( - to Xx T Et o 1 x>k) + 6 
= G(x T x > fc; Ip) + 5, 

where I p is the identity matrix. This can be arbitrarily small by choosing fc large 
enough and S small enough. Therefore, almost all updated data points are in 
B(0, e) at io-th iteration, where B(0, e) = {x : ||x||2 > e}- For iteration t > t n , 
all updated data points within B(0, e) will not move outside B(0, e), since there 
are more updated data points and hence more influence in the direction toward 
to zero. Therefore, |a;|„| < e for almost all i and for all t > to and n > n to . By 
Corollary [T] all data points will converge to a single location. We have 

II lim axilla < e. 
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for all i when n > n to , which completes the proof. 



□ 



Remark 1. In this section, we present the results under the assumption that 
both f and G are Normal. The results can be generalized to general second order 
kernel functions with translation invariance. For this type of kernel functions, 
the empirical distribution at each iteration still converges to some distribution, 
and the variance is decreasing through iterations. The shrunk distribution, how- 
ever, may not have a nice form as that in the Normal case. 

Remark 2. If the data points are sampled from a finite mixture distribution, 
the locations which the data points converge to through the iterative process may 
not be consistent to the parameters. Take the mixture distribution oeiN(fj,i, 1) + 
(1 — Q!i)iV(^2)l) as an example. By choosing a proper f, data points will be 
clustered into two groups. Since the domains of these two Normal distribution 
are overlapped, the converged locations through the iterative process will not 
converge to \i\ and [i2- 



5 Simulation 

In this section we consider a one dimensional case where the data is sampled 
from N(0, of). The / function in © is taken to be / = exp(-(x-y) 2 /2r 2 ). We 
used three experiments to compare the blurring and the nonblurring processes 
in the following three aspects: the convergence rate, the efficiency, and the 
robustness to the outliers. 

5.1 Convergence rate 

Based on we have shown that 

K ^W- J y f( x -y).dGM(y) -<?1+t* X - 

For the nonblurring process, the integration is over the original data, instead 
of updated data. The shrinkage ratio is therefore ^rfcs > meaning that the 
convergence rate of the blurring process is higher than that of the nonblurring 
process. Take <tq = 1 and r = 2 as an example. For the blurring process, 

CTq l 2 

ct 2 0.2 2 
a 2 = ^^=0.2__« 0.002 

al 0.002 2 
CT3 = -^=0.0025^^ « 0.000000002. 
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For the nonblurring process, 



l 2 

0.04-, — = 0.008. 





+ T 2 










-I 













2 <jZ+t 2 l 2 + 2 2 



(a) mean 



(b) standard deviation 



(c) log scale of s 




Figure 1: The simulation results on 100 samplings from N(0,1). The solid line 
is from the blurring process, and the dash line is from the nonblurring process. 



In this experiment, we sampled 100 data points from iV(0, 1). Fig. Q]presents 
the simulation results by the blurring and the nonblurring process. In details, 
Fig. QTa) shows that both processes converged to very close to the true mean 
of zero. Fig. [TJb) shows that the standard deviations of the updated data 
points dropped way down at the first iteration and became nearly zero after the 
second iteration. This illustrates that both processes converged very fast, while 
the updated data points by the blurring process shrunk even much faster. Fig. 
[He) further presents the shrinkage of the updated data points in terms of the 
log scale of the standard deviations in Fig. QTb) . 



5.2 Efficiency 

In this experiment we consider r to be 0.5, 1 or 2. For each r value, we simu- 
lated 100,000 sets of 100 data points, which were again sampled from 7V(0, 1). 
According to the simulated 100,000 sets, we summarized the means and the stan- 
dard deviations of the following three statistics: the sample mean, the number 
each set of data points converged to by the blurring process and that by the 
nonblurring processes. The results were presented in Table [T] 

In this experiment, we consider 100 data points were sampled from N(0, 1). 
Now we experiments with r = 0.5, 1 and 2. For each parameter, we simulate 
100,000 times. The means and the standard deviations of the sample mean and 
the converged numbers of blurring and nonblurring processes in these 100,1000 



1G 



simulations are presented in Table [TJ There is no noticeable difference between 
the means of three statistics. We did run multiple 100,000-sample sets, and 
the orders (with respect to the absolute value) are different for different sets. 
However, the standard deviations of the three statistics are clearly different. 
The standard deviations of the sample means are close to 0.1, which is the 
theoretic value. The standard deviations of the converged number from the 
blurring process are smaller than that from the nonblurring process. Therefore, 
the converged number from the blurring one seems to be a better estimator over 
that from the nonblurring one. 



Table 1: The mean and the standard deviation of the converged points 



T 


Sample Mean 


Blurring 


Nonblurring 


0.5 


-1.897*10- 4 (0.1000) 


-5.697*10- 4 (0.1210) 


-5.349*10~ 4 (0.2126) 


1 


1.260*10- 4 (0.0997) 


2.400*10~ 4 (0.1043) 


4.185*10- 4 (0.1239) 


2 


6.352*10~ 4 (0.0998) 


5.842*10~ 4 (0.1008) 


5.565*10- 4 (0.1025) 



There is no noticeable difference between the means of the three statistics. 
We did run multiple 100,000-sample sets, and the orders (with respect to the 
absolute value) are different for different sets. However, the standard deviations 
of the three statistics were clearly different. The standard deviations of the 
sample means were close to 0.1, which is the theoretical value. The standard 
deviations of the numbers where the data points converged to by the blurring 
process were closer to those of the sample mean, and were smaller than those 
by the nonblurring process. This suggests that the blurring process produced 
more efficient? estimates than the nonblurring process. 

5.3 Robustness to outliers 

n this experiment, each data set has 95 data points sampled from N(0, 1) and 
another 5 data points from N(5, 1). We consider r to be 0.5, 1, or 2. For each 
r value, we simulated 100,000 data sets. 

By Corollary [TJ all data points should converge to a single number. How- 
ever, due to the floating precision, the outliers which are far from most of the 
data points may converge to different numbers. For both the blurring and the 
nonblurring process, we take the number that most of data points converged to 
as the statistic. The results are presented in Table [5] While the sample mean 
was no longer an unbiased estimator of the true mean when outliers are present, 
Table [H shows that the numbers where most of data points converged to by the 
blurring and the nonblurring processes were still very close to the true mean 
of zero. This suggests that both processes remained to produce good estimates 
for the mean. The standard deviations produced by the blurring process were 
again smaller than those by the nonblurring process. 
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Table 2: The mean and the standard deviation of the converged points with 5% 
outliers 



T 


Sample Mean 


Blurring 


Nonblurring 


0.5 


0.2495 (0.1003) 


-0.0006 (0.1241) 


-0.0038 (0.2167) 


1 


0.2495 (0.1000) 


-0.0106 (0.1102) 


0.0002 (0.1276) 


2 


0.2503 (0.0998) 


0.0928 (0.1046) 


0.0220 (0.1080) 



6 Discussion and Conclusion 

In this paper, we first give a rigorous mathematical proof of the convergence 
of the blurring mean-shift process. Our result is under the condition that / is 
PDD and w depends only on data points. 

We also prove the consistency of the blurring process, which ensures the 
estimation to converge to the true values of the parameters as the number of 
data points goes to infinity. Our consistency proof is for the Normal case, 
in which we could show the explicit form of the shrinkage rate of the data 
points. The consistency for more general kernel functions can be proven in 
similar arguments. 

From our simulation studies, both the blurring and the nonblurring processes 
have good robustness against outliers. The estimations by the blurring process 
usually yield smaller variances than those by the nonblurring process. 
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